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Abstract 

The  question  of  the  statistical  significance  of  the  difference  in  error 
rates  of  two  speech  recognition  algorithms  is  almost  invariably  ignored 
in  the  literature.  If  it  is  considered,  it  is  usually  assumed  that  the 
algorithms  were  tested  on  two  independent  test  sets,  whereas  in  reality, 
they  are  normally  tested  on  the  same  set.  The  Gillick  Test  is  a  simple 
and  elegant  technique  for  deciding  whether  the  difference  between  the 
error  rates  of  two  ajgorithms  tested  on  the  same  data  is  significant. 
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1  Introduction 


Assessment  is  certainly  the  most  neglected  aspect  of  work  upon  automatic  speech  recog¬ 
nition  but  it  is  vitally  important.  The  literature  currently  abounds  with  descriptions  of 
new  algorithms  and  techniques  for  speech  recognition  which  show  an  improvement  over 
previous  algorithms,  but  rarely,  if  ever,  do  they  address  the  key  question  of  whether  the 
obtained  improvement  in  performance  is  statistically  significant.  This  memo  describes 
a  very  simple  test  which  enables  comparison  of  two  speech  recognisers  tested  on  the 
same  set  of  utterances,  normal  practice  when  developing  new  algorithms.  It  is  entirely 
based  on  unpublished  notes  by  Larry  Gillick  of  Dragon  Systems  Inc. 


2  Some  Preliminaries 


The  Binomial  distribution  gives  the  probability  of  exactly  k  errors  occuring  in  n  trials 
when  the  underlying  probability  of  an  error  is  e,  i.e.: 


Pr(k  errors) 


B(n,e| 


k  =  1,2, ...  ,n 


The  expectation  (mean)  of  the  above  Binomial  distribution  is  ne  and  the  variance  is 
ne(l  -<).  When  n  is  large,  the  Binomial  distribution  can  be  approximated  by  a  Normal 
distribution  with  mean  ne  and  variance  ne(l  -  e),  i.e.: 


B(n,eJ  ss  A'(ne,ne(l  -  e)j 


A  result  we  shall  use  in  section  4  is:  if  A  and  B  are  Normally  distributed  random 
variables  with  expectations  E(A)  -  and  E(B)  =  pB,  then  E(A  +  B)  =  pA  +  pB. 
Futhermore,  if  A  and  B  are  independent,  V ar(A  ±  B)  —  Var(A)  +  V'ar(B).  Finally, 
it  is  assumed  that  we  are  dealing  with  recognition  of  isolated  utterances  and  that 
no  rejections  are  allowed  (or  alternatively,  rejections  are  counted  as  misclassifications). 
Hence  the  error  rate  of  the  recogniser  is  defined  to  be  the  probability  that  it  misclassifies 
an  utterance. 


3  Hypothesis  Testing 

Hypothesis  testing  is  a  standard  way  of  quantifying  the  statistical  significance  of  data 
produced  by  different  processes  is.  A  null  hypothesis  H0,  is  proposed,  the  data  is 
analysed  and  H0  is  accepted  or  rejected  at  a  certain  level  of  significance.  Suppose  our 
two  speech  recognisers  are  R i  and  R2;  the  null  hypothesis  (Ho)  is: 

R\  and  R2  have  the  same  underlying  (but  unknown)  error-rate. 

If  subsequent  analysis  of  the  data  showed  that  we  should  reject  H0  at  the  0.1%  level, 
this  means  that  if  H0  were  in  fact  true,  we  would  only  observe  a  discrepancy  between 
the  error  rates  equal  to  or  greater  than  that  actually  observed  on  0.1%  of  occasions. 
Note  that  rejection  of  Ho  does  not  strictly  tell  us  which  recogniser  is  better,  but  it  is 
safe  to  take  the  commonsense  view  here.  A  useful  introduction  to  hypothesis  testing 
and  the  use  of  standard  tables  (see  next  section)  is  given  in  [1). 
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4  Testing  on  Independent  Test  Sets 

Firstly,  we  consider  the  case  where  Rt  and  R2  are  tested  on  two  independent  test-sets, 
each  of  size  n  utterances.  This  introduces  some  of  the  statistical  ideas  which  are  used 
in  section  5  when  they  are  tested  on  the  same  data. 

Suppose  the  underlying  (but  unknown)  error  rate  of  recogniser  R 1  is  ei  and  R\ 
makes  X\  errors  on  its  test  set.  Then  from  section  2: 


A',  =  Bjn.e,] 

*»  R  jnei,nei(l  -  e,)] 


The  best  estimate  e,  of  e\  is: 

.  A', 

e>  =  ~ 

so  using  equations  2  and  3,  e'i  will  be  Normally  distributed: 


R 


ei, 


ei(l  -  C) 


Similarly  for  /?2,  which  has  error  rate  e2: 


e2  a  R 


e2(l  -  e2) 

€2,  - 

n 


(1) 

(2) 

(3) 

(4) 

(5) 


The  key  to  testing  H0  is  to  consider  the  mean  and  variance  of  the  random  variable 
e“i  -  e*j  1 .  Applying  the  result  stated  in  section  3  for  random  variables  A  and  B  to 
equations  4  and  5  gives: 


e'i  -  e"2  ss  R 


.  ..  e>(l  -  «■)  , 

ei  ~  ej, - + 

n 


«2(1  -  <») 
n 


If  H0  holds,  e,  =  e2  =  e  (say)  and: 


e'j  -  e2  ~  R 


0, 


2e(l  -  e) 


if  Ho  holds 


(6) 


The  probability  of  observing  the  measured  value  of  e’t  -  e2  from  a  Norma)  distribution 
with  zero  mean  and  variance  2e(l  -  e)/n  then  tells  us  at  what  level  of  statistical 
significance  to  accept  or  reject  H0-  This  probability  is  easily  found  by  consulting 
standard  statistical  tables.  Note  that  in  estimating  the  variance,  e  is  unknown  and  can 
be  estimated  as  i  (the  average  estimated  error)  =  (A'i  +  A’2)/2n. 


4.1  An  Example  using  Independent  Test  Sets 

Let  us  take  an  example  to  illustrate  this.  In  a  recent  test,  it  was  found  that  two  recog- 
nisers  R\  and  R2  gave  72  and  62  errors  respectively  on  a  test-set  of  size  1400  utterances 
( for  the  purposes  of  this  calculation,  of  course,  we  pretend  that  the  recognisers  were 
tested  on  two  independent  test-sets  each  of  size  1400  utterances).  The  tables  of  the 

‘the  same  idea  was  used  in  (2[. 
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cumulative  >.'ormal  distribution  refer  to  a  distribution  with  zero  mean  and  unity  vari¬ 
ance,  so  a  datapoint  i  from  a  distribution  with  mean  /z  and  standard  deviation  a  is 
normalised  to  z  where: 


z  ~ 


Hence  we  compute: 


Jfi  ~  61 
vSE3 


(’) 


Notice  that  |e"i  -  e2|  is  computed,  because  to  accept  or  reject  Ho,  we  are  only  interested 
in  the  distance  of  |ej  —  c2|  from  zero  and  not  whether  e'i  >  e2  or  vice  versa.  Accordingly, 
we  require  the  probability  P  that  a  point  falls  outside  z  on  either  side  of  the  mean  - 
this  probability  is  shown  as  the  shaded  area  in  Fig  1: 


Fig  1:  A  two-tailed  test  on  a  Normal  distribution  with  zero  mean 

We  therefore  use  the  ‘two-tailed’  tables  of  the  cumulative  Normal  distribution,  and 
putting  the  above  figures  into  equation  7,  find  z  =  0.88531  and  hence  P  =  0.376  2. 
This  means  that  if  Ho  is  assumed  (i.e.  the  underlying  error  rates  are  equal),  we  would 
expect  a  difference  between  two  observed  error  rates  equal  to  or  greater  than  that 
actually  observed  on  37.6  %  of  occasions.  In  other  words,  there  is  a  very  good  chance 
that  the  underlying  error  rates  are  equal  and  all  we  have  observed  is  a  random  effect. 
It  will  be  seen  in  section  4.1  that  the  extra  information  provided  when  the  recognisers 
are  tested  on  the  same  data  may  greatly  increase  the  significance  of  the  result. 

5  Testing  on  the  Same  Data  Set 

Consider  the  more  realistic  situation  where  the  test  set  consists  of  a  single  set  of 
utterances  U\,  f/2, . . . ,  Un.  For  any  utterance  U,,  define  the  following  probabilities: 

<7oo  =  Pr(R i  classifys  U,  correctly,  Rj  classifys  U,  correctly) 

9oi  =  Pr(f?i  classifys  U,  correctly,  Rt  classifys  U,  incorrectly) 

q io  =  Pr(f?i  classifys  U,  incorrectly,  Ri  classifys  U,  correctly) 

<7u  =  Pr(/?i  classifys  1/,  incorrectly,  classifys  (7,  incorrectly) 

2tlie  NAG  library  function  S015ABF  returns  1  -  j- 
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These  probabilities  can  be  visualised  more  easily  in  the  following  table  form: 

R* 


Right 

Wrong 

Ri  Right 

9oo 

9oi 

W'rong 

9io 

9n 

r 

i 

l 


Table  1:  Joint  probability  of  correct  decision  or  error  for  two  speech  recognisers  tested 

on  the  same  data 


It  is  clear  that: 


ei  —  9io  +  9n 
e2  =  9oi  +  9ii 


If  Ho  holds,  ei  =  e 2,  so  901  =  9io-  Let: 


9  = 


9l0 

9oi  +  9io 


(S) 


Then  if  H0  holds,  q  =  Equation  8  may  be  interpreted  as  follows:  q0i  +  q  10  is  the 
probability  that  only  one  of  the  recognisers  makes  an  error  on  a  giv'n  utterance;  hence 
q  is  the  probability  that  R^  makes  an  error  on  a  given  utterance  given  that  only  one 
of  the  recognisers  makes  an  error. 

Of  course  the  qIZ  probabilites  are  computable  only  with  an  infinite  test  set.  How¬ 
ever,  we  can  estimate  them  from  our  finite  test  set.  Define: 


noo  =  No  of  utterances  which  i?j  classifys  correctly,  classifys  correctly 

n0i  =  No  of  utterances  which  R\  classifys  correctly,  R2  classifys  incorrectly 

n10  =  No  of  utterances  which  Rt  classifys  incorrectly,  R2  classifys  correctly 

nn  =  No  of  utterances  which  R\  classifys  incorrectly,  R2  classifys  incorrectly 

Once  again,  this  is  more  easily  visualised 


Right 

Wrong 


as: 


Rj 

Right  Wrong 


Roo 

Roi 

Rio 

R11 

Table  S:  Distribution  of  numbers  of  correct  decisions  or  errors  for  two  speech 
recognisers  tested  on  the  same  data 

Now: 

Roi  -+  Rio  =  No  of  utterances  on  which  only  one  recogniser  makes  an  error 
n10  =  No  of  utterances  on  which  R,  makes  an  error,  R2  classifys  correctly 
9  =  Pr(/?i  makes  an  error  given  that  only  one  recognisei  makes  an  error) 

These  three  statements  should  make  it  clear  that: 


Rio  —  B  [roi  +  Rio,  q] 

ss  N  |9(noi  +  Rio). 9(1  -  9)(roi  4  Rio)] 
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The  best  estimate  of  q  is  q  where: 


5.1  Examples  using  the  Same  Test  Set 

We  can  now  drop  the  pretence  of  section  4.1  of  two  independent  test  sets  and  repeat 
the  calculation  on  the  basis  that  Rt  and  /?2  were  tested  on  the  same  test-set.  The 
distribution  of  errors  from  this  test  was: 


R,  Right 
Wrong 

Hence  n0i  =  7,  n10  =  17.  q  =  0.70833,  z  =  2.041  and  P  =  0.0412.  If  Ho  were  true, 
error  patterns  indicating  a  discrepancy  between  the  error  rates  as  large  as  or  larger 
than  this  would  be  observed  on  only  just  over  4%  of  occasions  (c.f.  37.6 %  of  occasions 
if  independent  test  sets  are  assumed),  so  there  is  quite  a  good  chance  that  a  genuine 
difference  exists. 

It  is  instructive  to  compare  the  values  of  P  for  different  error  patterns.  For  instance, 
suppose  that  R !  and  Rj  made  the  same  numbers  of  errors  as  above  but  the  error  pattern 
was: 


Right 
Wrong 

Here,  P  =  0.3876  so  there  is  very  little  evidence  for  a  difference  between  the  recognisers. 
Consider  another  error  pattern: 


R7 

Right  Wrong 


2721 

62 

72 

0 

Right  Wrong 


2721 

7 

17 

55 

6 


R.k 


Right 

Wrong 


/?2 

Right  Wrong 


2721 

0 

10 

62 

P  =  0.00517,  convincing  evidence  for  a  difference. 


5.2  Comments  on  the  examples 

Notice  that  n00  and  nn  are  not  considered  in  the  calculations,  so  that  information  on 
the  relative  performance  of  the  classifiers  is  supplied  only  when  they  disagree.  A  large 
value  of  |nlp  -  n01|  =>■  large  q  =>  large  z  in  equation  10,  indicating  the  possibility  of  a 
genuine  difference  in  error  rates;  however,  z  is  ‘tempered’  by  the  term  1  /4 ( n io  +  rtpi) 
which  is  large  when  n ,0  +  noi  is  small,  reducing  z  and  hence  the  significance  of  the 
result.  These  observations  tie  up  satisfyingly  with  one’s  intuitions  about  testing  two 
classifiers  on  the  same  data.  It  is  worth  mentioning  that  the  more  disjunct  the  error 
pattern  is  (i.e.  the  higher  the  ratio  (n,o  +  «o» )/  ” 1 1 ) .  the  greater  the  improvement  in 
performance  obtainable  by  constructing  a  combined  classifier  (using  some  means  of 
arbitration  when  R\  and  R2  disagree). 


6  Discussion  and  Summary 

The  Gillick  test  (actually  an  application  of  McNemar’s  test)  puts  the  comparison  of 
two  classifiers  tested  on  the  same  data  on  a  firm  statistical  footing.  A  feature  of  the 
test  is  that  it  takes  account  only  of  utterances  on  which  the  classifiers  disagree,  an 
obvious  (but  hithereto  unexploited)  strategy  for  a  comparative  test.  Depending  on  the 
distribution  of  errors,  it  can  place  a  much  higher  statistical  significance  on  the  difference 
in  the  error  rates  than  that  given  by  the  (almost  always  incorrect)  assumption  of  two 
independent  test  set'.  It  is  very  simple  to  apply  and  it  is  recommended  that  it  be  used 
whenever  two  recognisers  are  tested  on  the  same  data. 
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